Faculty of Science, Technology, Engineering and Mathematics 
M249 Practical modern statistics 


The Open 
University 


M249 


TMAO1 20175 


Cut-off date 6 December 2017 


You can submit each TMA either by post or electronically using the online 
TMA/EMA service. Please read the guidance under the ‘Assessment’ tab on 
the M249 website. 


Each TMA is marked out of 100. The marks allocated to each part of each 
question are indicated in brackets in the margin. Your overall score for each 
TMA will be the sum of your marks for all questions in that TMA. 


Copyright © 2017 The Open University WEB 05690 3 
1.1 


General advice on TMAs for M249 


Here are a few tips that you should bear in mind when answering TMA 
questions for M249. 


You will often be asked to do several things within the same part of a 
question, so you should make sure to read the question carefully so as 
not to miss anything out. 


You should keep all your answers brief. For example, when asked to 
summarize results, you should write a few sentences stating the key 
findings, including relevant numerical values rounded appropriately, 
and give a brief interpretation of these findings. The precise length of 
your answer will depend on the context, but you should aim to be 








concise: you will never need more than a few sentences to obtain full 
marks. 


Your work should include only your answers to the questions: 
do not include any material that was not asked for in the 
questions. Marks are calculated according to a mark scheme that 
only relates to what has been asked in the question. ‘There are no 
extra marks available for work that has not been asked for, no matter 
how good any extra work may be. 


If a question mentions the M249 software or a data file, then you 
should use the software to obtain your answer. 


In some questions, you might be asked specifically to do calculations 
by hand without using SPSS (as you would in the examination). In 
this case, you should not provide output from the M249 software 
(although you may use the M249 software to check your answers). 


When asked to do calculations by hand, you should include your 

working. If you don’t, and you make a mistake, then your tutor will not 
be able to provide any feedback about where you went wrong. Leaving 
out your working may also cost you marks if it is specifically asked for. 


When you are asked to supply output from the M249 software, you 
should select appropriate output and insert it into your solution to 
support the point that you wish to make. You should not include 
all the output and rely on your tutor to sort out which parts 
relate to which questions. 


It is not sufficient to provide only output from the M249 software. You 
should always accompany the output that you do provide with a brief 
sentence in English, as if you were writing it for someone who is not 
familiar with the software. 


Best wishes! We hope you enjoy M249. 
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TMA 01 Cut-off date 6 December 2017 


Questions 1 to 5 below, on Introduction to statistical modelling and Book 1 
Medical statistics, form tutor-marked assignment M249 01. Question 1 (on 
Introduction to statistical modelling) is marked out of 16, Question 2 (on 
Part I of Book 1) is marked out of 26, Question 3 (on Part II of Book 1) is 
marked out of 19, Question 4 (also on Part II of Book 1) is marked out of 16, 
and Question 5 (on Part III of Book 1) is marked out of 23. 


Question 1 — 16 marks 


This question 1s intended to assess your understanding of the use and 
interpretation of graphical and numerical summaries of data, data types, 
significance probabilities and confidence intervals, and your use of SPSS to 
obtain appropriate output. You should be able to answer this question after 
working through the Introduction to statistical modelling. 


In this question you will be required to supply SPSS output for parts (b), (d) 
and (f) only. In parts (b) and (f) you will need to edit the default SPSS plot, 
and you should include only the edited plots in your work. All SPSS output 
should be included in the body of your work at the relevant point, and you 
should include only what is relevant to the question and your answer. 


The Breeding Birds Survey is organized by the British Trust for Ornithology 
to monitor the relative abundance of different species of birds over time and 
in different areas of the UK. Each year, volunteers survey kilometre square 
sites at regular intervals, and note down the numbers of birds of each species 
that they observe on each occasion. In 2001, access to the countryside was 
restricted by a national outbreak of foot and mouth disease. 


The SPSS file rails.sav contains data on counts of coots and moorhens 
collected over the period 1994-2007. There are four variables: year, sites, 
coot and moorhen. 


(a) The variable sites gives the total number of sites surveyed each year. 
How did the restricted access in 2001 due to the national outbreak of 
foot and mouth disease affect the number of sites surveyed? 


(b) The variables coot and moorhen contain, respectively, the annual 
densities of coots and moorhens per square kilometre, over the survey 
sites. Obtain a multiple line plot of coot and moorhen by year (but do 
not include it in your answer). Edit this multiple line plot by adding a 
suitable title and changing the label for the vertical axis to something 
more informative than the default label Value. Include a copy of this 
edited graph with your answer. Comment briefly on the relative 
densities of coots and moorhens, and how they have changed over time. 


(c) For both of the following, state whether it is a discrete or continuous 
random variable: sites and coot. 


(d) Obtain a scatterplot of coot against moorhen, and include a copy of the 
graph with your answer. Comment briefly on the relationship, if any, 
between the two variables. 
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(e) Obtain the correlation coefficient between moorhen and coot, and obtain 
the significance probability for the test of the null hypothesis that the 
two variables are uncorrelated. What do you conclude from this test? 





(f) Create a new variable diff, containing the differences between the 
annual densities of coots and moorhens. (Part (a) of Exercise 7.1 of 
Introduction to statistical modelling is similar to this.) Obtain a 
histogram of diff, customized so that the bin size is 0.01. Include a 
copy of your final histogram with your answer. 


(g) Obtain the mean and standard deviation of diff, and a 95% confidence 
interval for the mean. (You may assume that any conditions required 
for using the confidence interval are satisfied.) Is it plausible that the 
underlying mean densities of coots and moorhens are the same? Explain 
your reasoning. 


Question 2 — 26 marks 


This question 1s intended to assess your understanding of case-control and 
cohort studies, and the use and interpretation of relative risks and odds 
ratios. You should be able to answer this question after working through 
Part I of Book 1. 


You do not need your computer to answer this question, though you may use 
it to check your answers if you wish. You must show detailed working: an 
answer with no working will be given a maximum of 1 mark. 


(a) A Swedish study was undertaken to investigate the association between 
the father’s age when a child is born and the risk of that child 
developing schizophrenia. Of the 754330 people born in Sweden between 
1973 and 1980, and still alive and resident in Sweden at the age of 16, 
who were included in the study, 42316 were excluded from the analysis 
due to missing data. The people included in the study were followed up 
for a mean of nine years after the age of 16. For each person in the 
study, their father’s age at the time of their birth and whether or not 
they developed schizophrenia were recorded. The data are in Table 1. 


Table 1 Father’s age at child’s birth and schizophrenia 


Developed schizophrenia 


Father’s age > 30 Yes No Total 
Yes 348 336 883 337 231 


No 291 374 492 314783 


(i) Which is the exposure and which the disease? What type of study 
is this? Explain your answer. 


(ii) Calculate by hand the relative risk for the association between the 
father’s age at a child’s birth and the child developing 
schizophrenia, and obtain a 95% confidence interval for the relative 
risk. Show your working. 


(iii) Summarize your results. 


When asked to summarize results, you should write a few sentences, 
quoting relevant numerical summaries appropriately rounded, and 
giving your reasoned interpretation of results. 
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(b) A case-control study was undertaken in the USA to investigate the 


association between head injuries while cycling and the wearing of 
helmets. The study included 235 cases who were each admitted to one 
of five hospitals with head injuries received while cycling. The control 
group consisted of 433 people who received treatment at the same 
hospitals for cycling injuries not involving the head. Each cyclist 
included in the study was asked whether or not they were wearing a 
helmet at the time of their accident. The data are given in ‘Table 2. 





Table 2 Head injuries and helmets 


Helmet Head injury cases Controls 


Yes 16 104 
No 219 329 
Total 235 433 


(i) State one advantage of the case-control method for this study, 
compared to the cohort method. 


(ii) Calculate by hand the odds ratio for the association between 
helmet use and head injury, and obtain a 95% confidence interval 
for the odds ratio. 


iii) By hand, test the null hypothesis of no association between head 
x y 
injury and safety helmet use using the chi-squared test. 


(iv) Summarize your results. 


(v) An estimate of relative risk is required. How might this be 
approximated for this case-control study? 
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Question 3 — 19 marks 


This question 1s intended to assess your ability to enter stratified data in 
SPSS, and your understanding of the use and interpretation of methods for 
analysing stratified data. You should be able to answer this question after 
working through Part II of Book 1. 


In this question you will be required to supply SPSS output for part (a) only, 
though you will be expected to use SPSS to answer the rest of the question. 
All SPSS output should be included in the body of your work at the relevant 
point, and you should include only what is relevant to the question and your 
answer. 


A study was undertaken to investigate how physical activity by adolescents 
impacts on other aspects of their lives. In this question, the association 
between level of physical activity (the exposure, classified as Low or High) 
and daily time spent playing computer games (the outcome, classified as Low 
or High) is considered. Table 3 shows data on these variables, stratified by 
gender. 





Table 3 Physical activity and computer gaming 


Boys 

Time spent on computer games 
Physical activity level High Low Total 
High 271 223 494 
Low 146 122 268 
Girls 

Time spent on computer games 
Physical activity level High Low Total 
High 45 229 274 


Low 86 307 393 


(a) Create an SPSS data file containing the data in Table 3. Use the 
following variable names and value labels: 


count 
exposure (with value labels high and low) 
outcome (with value labels high and low) 


gender (with value labels boys and girls). 


Use SPSS to obtain a crosstabulation of your data, and include the 
SPSS output in your answer. [3] 


(b) For boys the stratum-specific odds ratio, with 95% confidence interval, 
for the association between physical exercise and computer gaming is 
ORpoys = 1.015, (0.753, 1.369). Obtain and report the corresponding 
stratum-specific odds ratio for girls. Also obtain the unadjusted odds 
ratio and 95% confidence interval. Comment carefully on all these 
results, and on the role of gender in this analysis. [6] 


(c) Carry out Tarone’s test for homogeneity of the odds ratios. Write down 
the test statistic and the p value, and interpret your findings. [2] 
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(d) Obtain and report the Mantel-Haenszel odds ratio and its 


(e) 


(£) 


95% confidence interval for the association between physical exercise and 
computer gaming, adjusted for gender. Comment briefly on the value of 
this odds ratio compared to the stratum-specific odds ratios in part (b). 


Perform the Mantel-Haenszel test for no association between physical 
exercise and computer gaming, allowing for the effect of gender. Write 
down the test statistic and p value, and interpret your results. 


Summarize the results of the analysis as a whole. 


Question 4 -— 16 marks 


This question is intended to assess your understanding of the use and 
interpretation of methods for analysing 1—1 matched case-control studies, and 
of methods for analysing dose-response data. You should be able to answer 
this question after working through Part II of Book 1. 


(a) A case-control study was conducted in Western Australia on the 


association between mobile phone use while driving and car crashes. 
One of the analyses in this study involved 236 drivers who owned a 
mobile phone and who were admitted to hospital following a car crash. 
For each of these individuals, the case interval was defined as the 
ten-minute period immediately preceding the crash, and the control 
interval was defined as a ten-minute interval arising at a comparable 
time while the individual was driving seven days before the crash. 
Mobile phone company records were used to determine whether the 
individual had been using the phone during either the control interval or 
the case interval. 





This study design is a 1-1 matched case-control study. (It is slightly 
different from other case-control studies in that case and control 
intervals are sampled from the same individual. This has no bearing on 
the question. Such studies are called case-crossover studies.) The data 
are given in Table 4. 


Table 4 Mobile phone use in matched case and control intervals 


Control interval 
Used phone Did not use phone 


Case Used phone 
interval Did not use phone 





(i) Obtain the Mantel-Haenszel odds ratio for the association between 
mobile phone use and hospitalization following a car crash, and 
calculate the 95% confidence interval for the odds ratio. 


(ii) Summarize and interpret your results. 


(iii) Some individuals who were hospitalized following car crashes 
refused to take part in the study. How might this have resulted in 
selection bias? Given that mobile phone use while driving was 
illegal in Western Australia at the time when this study was 
conducted, make an informed guess about the likely direction of 
such bias. Explain your reasoning carefully. 
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(b) A 1-1 matched control study was undertaken among women to 





investigate the effect of oestrogens on the risk of endometrial cancer. 
The investigators identified 63 cases of endometrial cancer occurring in 
a retirement community near Los Angeles, California (USA), from 1971 
to 1975. Exposure is considered as ‘ever having taken any oestrogen’. 
Each case was matched to a single control: each control was living in 
the community at the time the case was diagnosed, born within one 
year of the case, had the same marital status as the case, and entered 
the community at approximately the same time as the case. 


The 63 cases were then matched to a further three controls each (so four 
controls altogether each). The oestrogen amount (in mg/day) was 
recorded for 59 of the cases and 248 of the controls (the oestrogen 
amount for the remaining women in the study was unknown). The 
results are presented in Table 5. 





Table 5 Oestrogen dose and endometrial cancer 


Oestrogen dose Cases Controls 


0.626+ 16 19 
0.3-0.625 15 Al 
0.1—0.299 16 45 
None 12 143 
Total 59 248 


(i) Without using SPSS, calculate the dose-specific odds ratios for the 
association between oestrogen exposure and endometrial cancer, 
relative to the oestrogen dose ‘None’. 


(ii) The SPSS file endometrial-cancer.sav contains the data in 
Table 5. Open this file and carry out the chi-squared test for no 
linear trend between the dose of oestrogen and the risk of 
endometrial cancer. Report the test statistic and the p value. 
What do you conclude about the presence of a dose-response 
relationship? 





(iii) Summarize and interpret your results. 
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Question 5 — 23 marks 


This question 1s intended to assess your understanding of ideas and methods 
used in the design and analysis of randomized controlled trials and in 
meta-analysis. You should be able to answer this question after working 
through Part II of Book 1. You do not need your computer to answer this 
question. 


(a) A percutaneous endoscopic gastrostomy (PEG) is used to maintain 
nutrition in patients with eating difficulties. However, patients receiving 
a PEG can be vulnerable to infection. So there is a need for patients to 
be treated with an antibiotic prior to receiving a PEG. 


A randomized controlled trial was carried out to assess the effectiveness 
of a new antibiotic treatment for patients undergoing a PEG. Members 
of the intervention group were given the new antibiotic treatment, and 
members of the control group were given a standard antibiotic 
treatment. The outcome of interest is occurrence of a clinically 
identifiable wound infection. 


(i) Formulate the trial hypotheses clearly. 


(ii) Suppose that the trial was designed to have 80% power to identify 
a reduction in the numbers of patients with a wound infection from 
50% to 45%, with significance level 5%. If the trial composed two 
groups of equal size, calculate the total sample size required for the 
trial. 


(iii) During the follow-up period, the doctors were unaware of whether 
a person was in the intervention group or the control group when 
determining whether or not the patient had a clinically identifiable 
wound infection. Explain why this is desirable. 


(iv) Patients were randomized to the control or intervention group. 
Explain briefly why it was important to randomize. 


(v) Figure 1 shows the flow chart for the trial. 


234 patients 
randomized 
118 Control Group 116 Intervention 
Group 


18 excluded 16 excluded 
100 completed 100 completed 
study study 


53 clinically 44 clinically 


identifiable wound identifiable wound 


infections infections 





Figure 1 Flow chart for the PEG trial 


Draw up tables for an intention-to-treat analysis and for a 
per-protocol analysis. Obtain the odds ratio for each analysis, and 
interpret these values. 
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(b) Several clinical trials have examined the role of home blood pressure 


monitoring in managing hypertension (high blood pressure). In these 


studies, people were randomized to the control and intervention groups. 


People in the control group had their blood pressure monitored at a 
clinic, while people in the intervention group used different methods of 





home blood pressure monitoring. A forest plot for the meta-analysis of 


six such trials is given in Figure 2. (An odds ratio of greater than 1 
indicates that a person monitoring their blood pressure at home was 
more likely to have controlled their blood pressure compared to a person 
having their blood pressure monitored at a clinic.) 


Study OR 95% CI 
1 0.29 (0.05, 1.68) 
2 1.24 (0.56, 2.76) 
3 0.71 (0.43, 1.19) 
4 0.88 (0.36, 2.14) 
5 0.36 (0.08, 1.52) 
6 -E 0.26 (0.18, 0.38) 

Pooled <> 0.46 (0.35, 0.59) 


0.05 0.10 0.20 0.50 1.00 2.00 5.00 10.00 20.00 


Figure 2 Forest plot for studies of the effect of home blood pressure 
monitoring in managing hypertension 


(i) 
(i) 


(iii) 


Describe the odds ratios obtained in these six studies. 


Write down the pooled odds ratio with its 95% confidence interval. 
Describe the effect of home blood pressure monitoring in the 
management of hypertension. 


Which trial contributed most to the pooled estimate of the odds 
ratio? Explain your answer, and comment on its influence on the 
pooled odds ratio. 





The observed value of the test statistic for Tarone’s test for 
homogeneity was 20.1. Obtain an approximate p value and 
interpret it. Is it reasonable to pool the results of these studies? 
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